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I. ABSTRACT 



o 

o . 

. Motivation: Transcription networks, and other directed networks can be characterized by some topological observables such 

' as for example subgraph occurrence (network motifs). In order to perform such kind of analysis, it is necessary to be able to 

^ generate suitable randomized network ensembles. Typically, one considers null networks with the same degree sequences of the 

'"J original ones. The commonly used algorithms sometimes have long convergence times, and sampling problems. We present 

T— I here an alternative, based on a variant of the importance sampling Montecarlo developed by Chen et at. y]]. 



O: 

d 
•1— I 

cr: 



> 



Availability: softwares are available at |http://wwwteor.mi.infn.it~bassetti/downloads.html| 
Contact: marco.cosentino-lagomarsino@curie.fr, diana.fusco@studenti.unimi.it 

Supplementary Information: supplementary notes are available at |http://wwwteor.mi.infn.it~bassetti/downloads.html| 

II. INTRODUCTION 



Gene regulatory networks are graphs that represent interactions between genes or proteins. For example, in a transcription 
QQ network the nodes are genes or operons, identified with thek protein products, and the edges represent their transcriptional 
^— I regulatory regions along DNA |[l2ll . The simplest possible approach to study them is to consider their topology. The main 
T— I biological question that underlies these studies asks to establish to what extent the empirical biological topology deviates from 
a "typical case" statistics. In order to do that, one generates so called "randomized counterparts" of the original data set as a 
\^ null model. That is, an ensemble of random networks which conserve some topological observables of the original, such as the 
degree sequences, i.e. the number of outgoing and incoming links for each node. This approach has a wider application for 
C networks of different kinds |4, 10]. A directed network can be conveniently represented as a zero-one adjacency matrix where 

element is 1 if node j has a directed link pointing to node i (Fig.lA). The null ensemble of degree-conserving graphs 
J> translates into a set of matrices having the same row and column sums of the original matrix. Some algorithms to generate this 
uniformly distributed ensemble are commonly used IHIst]. In particular, one Markov Chain Montecarlo (MCMC) algorithm is 
based on swapping edges at random |6]. This generates an ergodic dynamics, with, however, large relaxation times to a uniform 
^ distribution. Another type of algorithm is the so called "stub-pairing" or Molloy-Reed algorithm [5,6], that consists in randomly 
■ ■ ' linking "stubs" made of nodes with required in- and outdegrees, in order to build a randomized instance 10, il. WhUe useful, 
this technique may fall in metastable states, where no stubs can be connected. The algorithm developed by Chen et al. is 
more efficient than the MCMC one [1] and does not run the risk of falling in metastable states. It is based on an application of 
importance sampling Montecarlo. It generates matrices with an almost uniform probability, and subsequently adjusts the sample, 
assigning to every element a certain weight. Finally, it is able to estimate the size of the sampled ensemble. 

Here, we present an implementation of this algorithm that works specifically on transcription networks, with two variants. 
The first variant is designed to improve the speed of the algorithm. The second variant enables to deal with ensembles of 
structured matrices, in particular with structured diagonal, as it is often done in transcription networks when dealing with self- 
regulations [4]. 
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m. ALGORITHM 



As the goal is the uniform distribution of the sample, the importance sampling weight for every element is 1/P{T), where 
P(r) is the matrix probability. The algorithm is illustrated in Fig.lA. The matrix is generated by filling column after column. 
Suppose, for example, the first column has already been generated and the second one (in pink in Fig.lA) must be extracted. 
One has to consider the row sums having subtracted the first column. At this point, one can compute a "constraint" inside 
the column in order to allow the algorithm not to fall in metastable states (Fig.lA and Subsequently, the constraint-free 
positions are filled with a probability that can be computed exactly [2]. In order to perform this operation, the row sums need 
to be ordered by rank. When all the columns are filled, the total probability of having a certain matrix is the product of all 
the column probabilities, which can be computed knowing the constraints of each column |2]. This number allows to weigh 
correctly the matrix sample. 

We introduced the following two variants. 

Large matrices with compact indegree Transcription networks typically have several hundreds of nodes. The computational 
cost for generating a column is of order O(M^c^) where M is the length of a column and c the number of Is contained in that 
column 1 1]. This is due to the fact that every time that a position must be selected, the algorithm has to evaluate the probability 
of success for every position inside the column |3]. 

We have demonstrated that the probability of success in a given position can be well approximated using the corresponding 
row-sum if the in-degree distribution is sufficiently limited in range. This last feature is typical of transcription networks. 
Consequently, as the probability of having a certain zero-one sequence does not depend on the order of extraction, it can be 
evaluated only once for every column, or, better, for each constraint. The computational cost for generating a column is then 
reduced to order 0{Mc). 

Structured diagonal Self-regulatory interactions are often considered to have a particular status (5]. They are represented in 
the matrix by 1 on the diagonal [4J. In order to constrain the diagonal, one has to modify the way the algorithm calculates the 
constraints inside the columns, accounting for the fact that some positions are not available for the extraction. 



IV. IMPLEMENTATION AND RESULTS 



Triangular network motifs As an example of application we have studied the occurrence of three triangular subgraphs 
(Fig.lC and ID). The FFL (Feed Forward Loop), SIM (triangular Single Input Module) and TGC (Three Gene ChainX for the 
transcription networks of E. coli lITsIl and S. Cerevisiae fisll verifying the results that can be found in the literature |0, fioll . 

In all cases, we find a quantitative difference between the subgraph distributions in the randomized ensembles with or without 
structured diagonal (Fig. IC and ID). In some instances, such as the biologically relevant FFL |4], this does not affect the decision 
of whether that subgraph is a motif. In other cases one can also find qualitative changes. This difference is more visible in E. 
Coli as sixty percent of its nodes are autoregulated, and less in S. Cerevisiae with only ten percent of autoregulations. 

Feedback We also evaluated (Fig. IC) the feedback in the graph, using a simple decimation algorithm that removes the input- 
and output- treelike components [11]. With this algorithm, the feedback is measured by the size Mcore of the decimated graph. 
We have ignored autoregolations. As expected, the sample with structured diagonal is shifted towards smaller amounts of 
feedback. This can be explained considering the lower amount of available links to rearrange if the selfregulators are fixed. 



V. CONCLUSIONS 



In conclusion, we have implemented a Montecarlo importance sampling algorithm to randomize directed graphs conserving 
the degree sequence, and evaluate topological observables. The algorithm follows the design principles of Chen et ai, and is de- 
signed to be more efficient without loss of uniformity on graphs with compact indegree such as the known transcription networks. 
Furthermore, we added a variant that works with constrained diagonal, as is usually done in motif discovery P]. We implemented 
the code in a simple three-node motif and feedback finder, that reproduces the results known in the literature. The version of the 
running code (in C^^) used for our analysis is publicly available at http://wwwteor.mi.infn.it~bassetti/downloads.html , and 
can be inserted in more general motif finding tools. 
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VI. ADDITIONAL NOTES ON THE IMPORTANCE SAMPLING RANDOMIZER FOR TRANSCRIPTION NETWORKS 



A. Introduction 



The purpose of these notes is to introduce and describe two modifications of the importance sampling randomization algorithm 
for directed graphs introduced in fll]. The sample of randomized graphs to be generated has to be uniform in the set of graphs 
having the same degree sequences as the original one, i.e. conserving the number of incoming and outgoing edges for each 
node |5]. These modifications are produced keeping in mind two important features of transcriptional regulation networks. The 
first is that these graphs have compact indegree. For example in the case of the Shen-Orr [5J data-set for the E. coli transcription 
network, a graph with about 400 nodes and 600 edges, the maximum indegree is of order 10, while the maximum outdegree 
has order 100. The second feature is that networks may have an abundance of self-interactions (this is the case for example in 
E. coli). For this reason, one may wish to consider randomizations that conserve the number of self-interactions, i.e. having 
structured diagonal in the adjacency matrix (see below). 



B. Summary of the Procedure 



A directed graph can be represented by an adjacency matrix G where the element indexed by (i, j) is 1 if gene j influences 
gene i, and otherwise. Row sums of the matrix represent the number of nodes receiving edges from each node (outdegree), 
column sums represent the number of nodes sending edges to each node (indegree). Consequently, generating randomized 
networks with fixed in- and outdegree is equivalent to generating randomized matrices with constant row- and column sums. 

The algorithm of Chen et al.\\\ has this scope, and achieves it using the Montecarlo importance sampling method: every 
matrix is generated column by column and is then weighted inside the sample with a certain analytically calculated weight. This 
weight consists in the inverse of the probability that the matrix is generated by the algorithm. The calculation of the matrix 
probability is a crucial point. It is performed using the conditional Poisson distribution [2]. This distribution allows to compute 
the probability of having a 0-1 sequence of length N with the constraint of having n nonzero entries. A key role is played by the 
function 

r 1 if fc = 

7^(fc, C) = <^ EscC,|B|=fc(aeB if < /c < \C\ (1) 
[O iffc>|C| 

where C is the set of the possible positions in the sequence (in this case C = {1, iV} and Wi is the weight assigned to position 
i. When a column is generated, this weight is Vi/N, where is the ith row sum. 

Suppose now that the positions where 1 are put are extracted one by one and that Ak is the set that contains the positions 
chosen after the fcth extraction. At the beginning — 0. Then at the fcth step the position j e Ak-i will be extracted with 
probability 

/■AC ~ ^fe-l\j) 

^^•^'^-1^ " {n-k + l)n{n-k+\,Al_^) ' 
where Wj — jzi^ is the weight assigned to position j. 



C. Large Matrices 



The first problem we had to face was due to the dimensions of our matrices. The networks we considered typically had about 
500 nodes, consequently the associated matrix is 500 x 500. With these number, the algorithm of Chen et. al. is too slow to 
generate a significant sample in reasonable time. Now, most of the computing time is required by the calculation of 7?.(c, A). 

To avoid this problem, we use the following method. Suppose that the ^th column is being generated, and it has to contain 
c edges, or units. Then for every row with r'^P ^ 0, at least the numerator (it depends on j) of Eq.|2]must be calculated. The 
denominator is a common factor to all the rows, and is not important at this step. A similar calculation has to be performed for 
every placement from 1 to c. The process for calculating TZ{c—k, has a computational cost of order C'((c—fc)*|A^_-^ |) l^]. 
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This calculation must be repeated for every available position j that is of order To avoid repeating the process for all 

the c extractions, we approximate A'f,_^ with the number of rows M (typically about 500) of the matrix, for every k, then the 
cost for generating a column becomes of order 0{M^c^). In other words, approximating the probability of selection of a certain 
position with its row sum, the algorithm should calculate the function TZ only once for every column, reducing considerably the 
computational cost. We will now argue that this approximation is acceptable for graphs with "small" indegree. 

The probability of selecting a string {Ai,i — ai, Ai^j — om) with prescribed sum does not depend on the extraction order. 
It simply writes 

M ■r-.M at 

P{\i = ai, Al, M = aM\ V a. = c) = ^^f^ (3) 

1=1 l-C{c,i^) 

where C is the set that represent the whole column. This means that for evaluating this probability one does not have to keep 
into account the whole process of extraction. However, the problem of making a good extraction still persists. In fact, even if 
the calculation of the sequence probability is correct, nothing assures that this sequence has been extracted with the conditional 
Poisson distribution. First of all note that the statistical meaning of 7?,(n, C) is: 

where the random variable Sc — T^^Li '^i- Consequently, if we compare the probabilities of extracting the position i and the 
position j at the fcth step, they can be written as 

{ p{i) cx riP{Si ^ c~ k) 
\p{j) cx r,P{S,^c-k), 

where Si stands for the sum of the elements of \i and Sj stands for the sum of the elements of A'f,_- \j. 
Now, note that 

P(5, = c-fc)= Hp- lii^-Py) ■ (6) 

BcC\i,\B\=c-kxeB yeB'= 

Among all the sets B, there will be some that contain j. Equivalently, for Sj, there will be some sets B containing i and some 
not containing it. As the sum runs over all the possible subsets, we can write it as follows 



P{Si = C - k) = Pj J2BcC\i,\B\=c-k,jeB TlxeB,x=^j Px Yiyi^B^i^ ^ Py) + 
^ + i^~Pj)J2BcC\i,\B\=c-k,j^BYlxeBPxYlyeB'',y^ji^~Pv) ^j-^ 

P{Sj = c-k) = PiJ2Bcc\].\B\=c-k,ieBllxi£B,x^iPxIlyeB''i^ " Py)'^ 

+ {i—Pi)J2BCC\j.\B\=c—k,ii^BUxeBP=i:UyeB':,y^i(^'~Py) 

Note that the factor multiplying pj in the first equation is the same as the factor multiplying pi in the second (21). The same 
happens for 1 — pj and 1 — Pi(5B). Thus, we can rewrite equation|2]as 



P(5,=c-fc) - p,2l+(l-p,)Q5 

PiS, =c-k) = pM+il-p^)'B ■ ^'^^ 



If we now consider the difference between the two equations 



P{S, =c-k)- P{Sj ^c-k) = [pj - p,)(2[ - ») , (9) 

we see that |2l — Q5| < 1, as separately 21 ^ 1 and *B < 1. Now pj — pi = ^p^^i where and rj ai'e the updated row sums 
(updated after the genration of the previous I — 1 columns). Then it is easy to see that \pj ~pi \ < where r,nax = maxiri. 

This is due to the fact the worst situation is when for example = 1 and Vj — N — I — 1. As rj < r„iax the most approximated 
step is when N — I = r,nax- This explains why smaller values of r,nax lead to a better approximation. The probability of being 
in this situation is proportional to the probability that, after TV — Vmax generated columns, the column with the maximum row 
sum is empty apart form one unit. In order to estimate it roughly, we consider the rows as independent and approximate the 
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row distribution with a Bernoulli distribution with probability of success Vmax/N, then the probability of having a sequence of 
N — Tjnax zeros is estimated as: 

p (-^ ^max) rnax 

This probability decreases if r^ax increases. For example, for the E. Coli graph, it is equal to 0.00257, as = 423 and 
rmax = 6. This gives a rough estimate of the maximum error. 



D. Constrained Diagonal 



Self-interactions (units on the diagonal of G) have particular status in transcription networks [01. For this reason, it is in- 
teresting to consider randomized ensembles where the diagonal is constrained. The problem is then how to make the diagonal 
inaccessible for the algorithm column-filling steps, and in particular, how to calculate the constrains inside the columns. 

First, we note that the positions above the diagonal behave as in the previous case. The problem restricts to are the positions 
below the diagonal. The algorithm to find the constraints inside every column can be summarized as follows. 

1 . Order the position ranking them from the highest row-sum to the lowest. 

2. If two or more positions have the same row-sums, the positions below the diagonal must be placed first. 

3. Among the positions below the diagonal having the same row sums, a precise order must be followed. Suppose that after 
the previous ordering step row i occupies position Pi. Then the rows with the lowest difference \ci — Pi \ have the priority. 

4. Let be the vector of positions before the ordering step, i.e. the row occupying now position j is the row that occupied 
position P^^ before reordering. Considering the difference J2i=i "X^iLi cp^*, one unit must be subtracted if Cp-i > k 

•' k 

and if P^T^ is under the diagonal. 

5. When k becomes large enough so that for some i Cp-i < k, one unit for every i must be summed. This must be done only 
if previously one unit had been subtracted for that positions. 

In this way the two vectors K and v identifying the constrains inside the columns will take into account the inaccessibility of 
the diagonal. Finally, while placing the units inside the columns, it must be kept in mind that the positions of the diagonal are 
not accessible. This must be considered also when assigning the weights to every row and the probability of having a certain 
number of units before every constraint. 
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FIG. 1 : A) Description of tiie algoritiim. Top: tlie grapli is translated into an adjacent matrix, which is filled column by column. Middle/Bottom: 
the procedure for generating a matrix is showed. The second column (in pink) must be generated. In order to perform this operation, the updated 
row sums and the constraint are calculated. The free positions are extracted and the algorithm considers the next (and last) column. Bottom: 
the column is then reordered by the residual row sums. One constraint is found and the last column filled. The matrix is now complete. B) 
Performance and uniformity of the sample. Top: comparison of the computational cost of our variant with the algorithm of Chen et ah. We 
plotted (in logarithmic scale) the time to generate 100 randomizations of networks of different size (a random subgraph of the E. Coli network, 
the entire E. Coli network, the B. Subtilis network[14]). The solid line (blue) represents the algorithm of Chen et ai, the dashed line (purple) 
represents our modified algorithm. Bottom: weight distribution of the two algorithms for a sample of 1000 randomizations of the E. Coli 
network. The narrower the distribution of weights, the better the algorithm approximates a uniform sample. Our modified algorithm (light 
green) tends to generate matrices that have a higher probability of extraction than the other one (dark blue). This ex plai ns why its distribution 
is narrower. C) Table summarizing the results for three triangular motifs and the feedback (measured by Mmr-p fllll of the E. coli and S. 
Cerevisiae transcription networks. Note that the E. Coli dataset has no feedback. D) Comparison of subgraph (FFL, TGC) distributions in 
E. coli randomizations for structured (light green) and unstructured (dark blue) diagonals. These distributions are systematically shifted with 
respect to each other. 



